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CO Abstract 

Subspace clustering refers to the task of finding a multi-subspace representation that best fits 
£S) a collection of points taken from a high-dimensional space. This paper introduces an algorithm 

inspired by sparse subspace clustering (SSC) [15] to cluster noisy data, and develops some 
novel theory demonstrating its correctness. In particular, the theory uses ideas from geometric 
functional analysis to show that the algorithm can accurately recover the underlying subspaces 
I under minimal requirements on their orientation, and on the number of samples per subspace. 

Synthetic as well as real data experiments complement our theoretical study, illustrating our 
approach and demonstrating its effectiveness. 

J Keywords. Subspace clustering, spectral clustering, LASSO, Dantzig selector, t\ minimiza- 

c/3 tion, multiple hypothesis testing, true and false discoveries, geometric functional analysis, nonasymp- 

i totic random matrix theory. 
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1.1 Motivation 



In many problems across science and engineering, a fundamental step is to find a lower dimensional 
subspace which best fits a collection of points taken from a high-dimensional space; this is classically 
achieved via Principal Component Analysis (PC A). Such a procedure makes perfect sense as long 
as the data points are distributed around a lower dimensional subspace, or expressed differently, as 
J> long as the data matrix with points as column vectors has approximately low rank. A more general 

model might sometimes be useful when the data come from a mixture model in which points do not 
lie around a single lower- dimensional subspace but rather around a union of subspaces. For instance, 
consider an experiment in which gene expression data are gathered on many cancer cell lines with 
unknown subsets belonging to different tumor types. One can imagine that the expressions from 
each cancer type may span a distinct lower dimensional subspace. If the cancer labels were known 
in advance, one would apply PCA separately to each group but we here consider the case where 
the observations are unlabeled. Finding the components of the mixture and assigning each point to 
a fitted subspace is called subspace clustering. Even when the mixture model holds, the full data 
matrix may not have low rank at all, a situation which is very different from that where PCA is 
applicable. 
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In recent years, numerous algorithms have been developed for subspace clustering and applied to 
various problems in computer vision/machine learning [40] and data mining [31]. At the time of this 
writing, subspace clustering techniques are certainly gaining momentum as they begin to be used in 
fields as diverse as identification and classification of diseases [27], network topology inference [17], 
security and privacy in recommender systems [44], system identification [3], hyper-spectral imaging 
[14], identification of switched linear systems [25,29], and music analysis [19] to name just a few. In 
spite of all these interesting works, tractable subspace clustering algorithms either lack a theoretical 
justification, or are guaranteed to work under restrictive conditions rarely met in practice. (We 
note that although novel and often efficient clustering techniques come about all the time, there 
seems to be very little theory about clustering in general.) Furthermore, proposed algorithms are 
not always computationally tractable. Thus, one important issue is whether tractable algorithms 
that can (provably) work in less than ideal situations — that is, under severe noise conditions and 
relatively few samples per subspace — exist. 

Elhamifar and Vidal [15] have introduced an approach to subspace clustering, which relies on 
ideas from the sparsity and compressed sensing literature, please see also the longer version [16] 
which was submitted while this manuscript was under preparation. Sparse subspace clustering 
(SSC) [15, 16] is computationally efficient since it amounts to solving a sequence of l\ minimiza- 
tion problems and is, therefore, tractable. Now the methodology in [15] is mainly geared towards 
noiseless situations where the points lie exactly on lower dimensional planes, and theoretical per- 
formance guarantees in such circumstances are given under restrictive assumptions. Continuing on 
this line of work, [34] showed that good theoretical performance could be achieved under broad 
circumstances. However, the model supporting the theory in [34] is still noise free. 

This paper considers the subspace clustering problem in the presence of noise. We introduce 
a tractable clustering algorithm, which is a natural extension of SSC, and develop rigorous theory 
about its performance. In a nutshell, we propose a statistical mixture model to represent data 
lying near a union of subspaces, and prove that in this model, the algorithm is effective as long as 
there are sufficiently many samples from each subspace and that the subspaces are not too close to 
each other. In this theory, the performance of the algorithm is explained in terms of interpretable 
and intuitive parameters such as (1) the values of the principal angles between subspaces, (2) the 
number of points per subspace, (3) the noise level and so on. In terms of these parameters, our 
theoretical results indicate that the performance of the algorithm is in some sense near the limit of 
what can be achieved by any algorithm, regardless of tractability. 

1.2 Problem formulation and model 

We assume we are given data points lying near a union of unknown linear subspaces; there are 
L subspaces Si, S2, ■ ■ ■ , <Sx of R n of dimensions di, cfe, . . . , (II- These together with their number 
are completely unknown to us. We are given a point set y c R n o£ cardinality N, which may be 
partitioned as 3^ = 3^1 u 3^2 u • • • u 3 ? l; for each £ € {1, . . . ,L}, 3^ is a collection of vectors that 
are 'close' to subspace Si. The goal is to approximate the underlying subspaces using the point set 
y. One approach is first to assign each data point to a cluster, and then estimate the subspaces 
representing each of the groups with PCA. 

Our statistical model assumes that each point y e y is of the form 



where x belongs to one of the subspaces and z is an independent stochastic noise term. We 



y = x + z, 
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suppose that the inverse signal-to-noise ratio (SNR) defined as E H^Hl/ll 33 !^ is bounded above. 
Each observation is thus the superposition of a noiseless sample taken from one of the subspaces 
and of a stochastic perturbation whose Euclidean norm is about a times the signal strength so that 
E 1 z || = c 2 1| x || | . All the way through, we assume that 

Tl 

a < a* , and maxd/ < en ^, (1-2) 

* (logiV) 2 

where a* < 1 and cq are fixed numerical constants. The second assumption is here to avoid unnec- 
essarily complicated expressions later on. While more substantial, the first is not too restrictive 
since it just says that the signal x and the noise z may have about the same magnitude. (With 
an arbitrary perturbation of Euclidean norm equal to two, one can move from any point x on the 
unit sphere to just about any other point.) 

This is arguably the simplest model providing a good starting point for a theoretical investiga- 
tion. For the noiseless samples x, we consider the intuitive semi-random model introduced in [34], 
which assumes that the subspaces are fixed with points distributed uniformly at random on each 
subspace. One can think of this as a mixture model where each component in the mixture is a 
lower dimensional subspace. (One can extend the methods to affine subspace clustering as briefly 
explained in Section 2.) 



1.3 What makes clustering hard? 

Two important parameters fundamentally affect the performance of subspace clustering algorithms: 
(1) the distance between subspaces and (2) the number of samples we have on each subspace. 



1.3.1 Distance/affinity between subspaces 

Intuitively, any subspace clustering algorithm operating on noisy data will have difficulty segmenting 
observations when the subspaces are close to each other. We of course need to quantify closeness, 
and the following definition captures a notion of distance or similarity /affinity between subspaces. 

Definition 1.1 The principal angles 6^\..., Q( dAd ) between two subspaces S and S' of dimensions 
d and d' , are recursively defined by 

cos{v y ') = maxmax 



ueS veS' \\u\\£ \\v\\g ||ttj 



with the orthogonality constraints u T Uj = 0, v T Vj = 0, j = 1, 1. 

Alternatively, if the columns of U and V are orthobases for S and 5", then the cosine of the 
principal angles are the singular values of U T V. 



Definition 1.2 The normalized affinity between two subspaces is defined by 

aff(5, S') = 



cos 



2 0(1) + . . . + cos 2 e( dAd ') 



dAd' 
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The affinity is a measure of correlation between subspaces. It is low when the principal angles are 
nearly right angles (it vanishes when the two subspaces are orthogonal) and high when the principal 
angles are small (it takes on its maximum value equal to one when one subspace is contained in the 
other). Hence, when the affinity is high, clustering is hard whereas it becomes easier as the affinity 
decreases. Ideally, we would like our algorithm to be able to handle higher affinity values — as close 
as possible to the maximum possible value. 

There are other ways of measuring the affinity between subspaces; for instance, by taking the 
cosine of the first principal angle. We prefer the definition above as it offers the flexibility of allowing 
for some principal angles to be small or zero. As an example, suppose we have a pair of subspaces 
with a nontrivial intersection. Then Icos^ 1 ^ = 1 regardless of the dimension of the intersection 
whereas the value of the affinity would depend upon this dimension. 



1.3.2 Sampling density 

Another important factor affecting the performance of subspace clustering algorithms has to do 
with the distribution of points on each subspace. In the model we study here, this essentially 
reduces to the number of points we have on each subspace. 1 

Definition 1.3 The sampling density p of a subspace is defined as the number of samples on that 
subspace per dimension. In our multi- sub space model the density of Si is, therefore, pe = Ni/de. 2 

With noisy data, one expects the clustering problem to become easier as the sampling density 
increases. Obviously, if the sampling density of a subspace S is smaller than one, then any algorithm 
will fail in identifying that subspace correctly as there are not sufficiently many points to identify 
all the directions spanned by S. Hence, we would like a clustering algorithm to be able to operate 
at values of the sampling density as low as possible, i.e. as close to one as possible. 



2 Robust subspace clustering: methods and concepts 

This section introduces our methodology through heuristic arguments confirmed by numerical ex- 
periments. Section 3 presents theoretical guarantees showing that the entire approach is math- 
ematically valid. Prom now on, we arrange the N observed data points as columns of a matrix 
Y = [yi, . . . , y N ] e R nxN . With obvious notation, Y = X + Z. 



2.1 The normalized model 

In practice, one may want to normalize the columns of the data matrix so that for all i, \\yi\\e 2 - 1- 
Since with our SNR assumption, we have \\y\\e 2 ~ \\ x \\e2 

+ a 2 before normalization, then after 

normalization: 

y * 7= — A x + z), 

Vl + o 1 

where x is unit-normed, and z has i.i.d. random Gaussian entries with variance a 2 /n. 



In a general deterministic model where the points have arbitrary orientations on each subspace, we can imagine that 
the clustering problem becomes harder as the points align along an even lower dimensional structure. 
2 Throughout, we take pi < e de ^ 2 . Our results hold for all other values by substituting pi with pi A e dl ^ 2 in all the 
expressions. 
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For ease of presentation, we work — in this section and in the proofs — with a model y - x + z 
in which \x\i 2 = 1 instead of \\y\\i 2 = 1 (the numerical Section 6 is the exception). The normalized 
model with \\x\\i 2 = 1 and z i.i.d. M(0,a 2 /n) is nearly the same as before. In particular, all of our 
methods and theoretical results in Section 3 hold with both models in which either \\x\\i 2 = 1 or 

\\y\W = L 

2.2 The SSC scheme 

We describe the approach in [15], which follows a three-step procedure: 

I Compute an affinity matrix encoding similarities between sample pairs as to construct a 
weighted graph W. 

II Construct clusters by applying spectral clustering techniques (e.g. [28]) to W. 

Ill Apply PCA to each of the clusters. 

The novelty in [15] concerns step I, the construction of the affinity matrix. This work is mainly 
concerned with the noiseless situation in which Y - X and the idea is then to express each column 
Xi of X as a sparse linear combination of all the other columns. The reason is that under any 
reasonable condition, one expects that the sparsest representation of Xi would only select vectors 
from the subspace in which xi happens to lie in. Applying the l\ norm as the convex surrogate of 
sparsity leads to the following sequence of optimization problems 

min ||/3L subject to Xi = X(3 and $ = 0. (2.1) 

/3eR N 1 

Here, denotes the ith element of (3 and the constraint = removes the trivial solution that 
decomposes a point as a linear combination of itself. Collecting the outcome of these N optimization 
problems as columns of a matrix B, [15] sets the similarity matrix to be W = \B\ + \B\ T (This 
algorithm clusters linear subspaces but can also cluster affine subspaces by adding the constraint 
(3 T 1 = 1 to (2.1).) 

The issue here is that we only have access to the noisy data Y: this makes the problem 
challenging, as unlike conventional sparse recovery problems where only the response vector X{ is 
corrupted, here both the covariates (columns of X) and the response vector are corrupted. In 
particular, it may not be advisable to use (2.1) with yi and Y in place of X{ and X as, strictly 
speaking, sparse representations no longer exist. Observe that the expression a?j = X(3 can be 
rewritten as yi = Y (3 + (z{ - Z(3). Viewing (z, - Z(3) as a perturbation, it is natural to use ideas 
from sparse regression to obtain an estimate (3, which is then used to construct the similarity 
matrix. In this paper, we follow the same three-step procedure and shall focus on the first step in 
Algorithm 1; that is, on the construction of reliable similarity measures between pairs of points. 
Since we have noisy data, we shall not use (2.1) here. Also, we add denoising to Step III, check the 
output of Algorithm 1. 

2.3 Performance metrics for similarity measures 

Given the general structure of the method, we are interested in sparse regression techniques, which 
tend to select points in the same clusters (share the same underlying subspace) over those that do 

3 We use the terminology similarity graph or matrix instead of affinity matrix as not to overload the word 'affinity'. 
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Algorithm 1 Robust SSC procedure 



Input: A data set 3^ arranged as columns of Y e R n . 

1. For each i e {1, . . . ,N}, produce a sparse coefficient sequence j3 by regressing the ith vector 
Ui onto the other columns of Y. Collect these as columns of a matrix B. 

2. Form the similarity graph G with nodes representing the N data points and edge weights 
given by W = \B\ + \B\ . 

3. Sort the eigenvalues <5i > 62 > ■ ■ ■ > Sn of the normalized Laplacian of G in descending order, 
and set 

L = N — argmax (5{ - 
i=l,...,JV-l 

4. Apply a spectral clustering technique to the similarity graph using L as the estimated number 
of clusters to obtain the partition yi,...,y^. 

5. Use PCA to find the best subspace fits ({S^lf ) to each of the partitions ({3^}f ) and denoise 
Y as to obtain clean data points X . 

Output: Subspaces {Se}i and cleaned data points X. 



not share this property. Expressed differently, the hope is that whenever Bij + 0, yi and yj belong 
to the same subspace. We introduce metrics to quantify performance. 

Definition 2.1 (False discoveries) Fix i and j e {1, . . . , N} and let B be the outcome of Step 
1 in Algorithm 1. Then we say that (i,j) obeying t is a false discovery if yi and yj do not 
belong to the same subspace. 

Definition 2.2 (True discoveries) In the same situation, obeying Bij t is a true discov- 

ery if yj and y^ belong to the same cluster/ subspace. 

When there are no false discoveries, we shall say that the subspace detection property holds. In this 
case, the matrix B is block diagonal after applying a permutation which makes sure that columns in 
the same subspace are contiguous. In some cases, the sparse regression method may select vectors 
from other subspaces and this property will not hold. However, it might still be possible to detect 
and construct reliable clusters by applying steps 2-5 in Algorithm 1, 

2.4 LASSO with data-driven regularization 

A natural sparse regression strategy is the LASSO: 

min -\\vi-Yp\l +A||/3|L subject to ft - 0. (2.2) 

Whether such a methodology should succeed is unclear as we are not under a traditional model for 
both the response yi and the covariates Y are noisy, see [32] for a discussion of sparse regression 
under matrix uncertainty and what can go wrong. The main contribution of this paper is to show 
that if one selects A in a data-driven fashion, then compelling practical and theoretical performance 
can be achieved. 
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2.4.1 About as many true discoveries as dimension? 

The nature of the problem is such that we wish to make few false discoveries (and not link too 
many pairs belonging to different subspaces) and so we would like to choose A large. At the same 
time, we wish to make many true discoveries, whence a natural trade off. The reason why we need 
many true discoveries is that spectral clustering needs to assign points to the same cluster when 
they indeed lie near the same subspace. If the matrix B is too sparse, this will not happen. 

We now introduce a principle for selecting the regularization parameter. Suppose we have 
noiseless data so that Y = X, and thus solve (2.1) with equality constraints. Under our model, 
assuming there are no false discoveries the optimal solution is guaranteed to have exactly d — the 
dimension of the subspace the sample under study belongs to — nonzero coefficients with probability 
one. That is to say, when the point lies in a d-dimensional space, we find d 'neighbors'. 

The selection rule we shall analyze in this paper is to take A as large as possible (as to prevent 
false discoveries) while making sure that the number of true discoveries is also on the order of the 
dimension d, typically in the range [0.5cf, 0.8(f]. We can say this differently. Imagine that all the 
points lie in the same subspace of dimension d so that every discovery is true. Then we wish to 
select A in such a way that the number of discoveries is a significant fraction of d, the number 
one would get with noiseless data. Which value of A achieves this goal? We will see in Section 

2.4.2 that the answer is around l/\/d. To put this in context, this means that we wish to select a 
regularization parameter which depends upon the dimension d of the subspace our point belongs 
to. (We are aware that the dependence on d is unusual as in sparse regression the regularization 
parameter usually does not depend upon the sparsity of the solution.) In turn, this immediately 
raises another question: since d is unknown, how can we proceed? In Section 2.4.4 we will see that 
it is possible to guess the dimension and construct fairly reliable estimates. 

2.4.2 Data-dependent regularization 

We now discuss values of A obeying the demands formulated in the previous section. Our arguments 
are informal and we refer the reader to Section 3 for rigorous statements and to Section 8 for proofs. 
First, it simplifies the discussion to assume that we have no noise (the noisy case assuming a « 1 is 
similar). Following our earlier discussion, imagine we have a vector x e R n lying in the ci-dimensional 
span of the columns of an n x N matrix X. We are interested in values of A so that the minimizer 
of the LASSO functional 

K{(3,\) = \\\x-X(3\l + \\\(3\\ h 

has a number of nonzero components in the range [0.5d, 0.8d], say. Now let (3 eq be the solution of 
the problem with equality constraints, or equivalently of the problem above with A -> + . Then 

~ \\x - X0f i2 < K0, A) < K($ cq , A) = A |/3 cq | £i . (2.3) 

We make two observations: the first is that if (3 has a number of nonzero components in the 
range [0.5d, 0.8cZ] , then \\x - Xf3\\g 2 has to be greater than or equal to a fixed numerical constant. 
The reason is that we cannot approximate to arbitrary accuracy a generic vector living in a d- 
dimensional subspace as a linear combination of about d/2 elements from that subspace. The 
second observation is that ||/3 e qlUi is ° n the order of \/d, which is a fairly intuitive scaling (we have 
d coordinates, each of size about l/\/d). This holds with the proviso that the algorithm operates 
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correctly in the noiseless setting and does not select columns from other subspaces. Then (2.3) 
implies that A has to scale at least like 1/vd. On the other hand, = if A > || X^H^ . Now the 
informed reader knows that || X T x\i ao scales at most like v/ (log N)/d so that choosing A around 
this value yields no discovery (one can refine this argument to show that A cannot be higher than a 
constant times \\\fd as we would otherwise have a solution that is too sparse). Hence, A is around 
1/Vd. 

It might be possible to compute a precise relationship between A and the expected number of 
true discoveries in an asymptotic regime in which the number of points and the dimension of the 
subspace both increase to infinity in a fixed ratio by adapting ideas from [5,6]. We will not do so 
here as this is beyond the scope of this paper. Rather, we investigate this relationship by means of 
a numerical study. 

Here, we fix a single subspace in R n with n = 2, 000. We use a sampling density equal to 
p - 5 and vary the dimension d e {10,20,50,100,150,200} of the subspace as well as the noise 
level a e {0.25,0.5}. For each data point, we solve (2.2) for different values of A around the 
heuristic A Q = l/\/d, namely, A e [0.1A o ,2A o ]. In our experiments, we declare a discovery if an 
entry in the optimal solution exceeds 10~ 3 . Figures la and lb show the number of discoveries per 
subspace dimension (the number of discoveries divided by d). One can clearly see that the curves 
corresponding to various subspace dimensions stack up on top of each other, thereby confirming 
that a value of A on the order of l/\/d yields a fixed fraction of true discoveries. Further inspection 
also reveals that the fraction of true discoveries is around 50% near A = A Q , and around 75% near 
A = A /2. 

2.4.3 The false-true discovery trade off 

We now show empirically that in our model choosing A around \\\fd typically yields very few false 
discoveries as well as many true discoveries; this holds with the proviso that the subspaces are of 
course not very close to each other. 

In this simulation, 22 subspaces of varying dimensions in IR n with n = 2,000 have been in- 
dependently selected uniformly at random; there are 5, 4, 3, 4, 4, and 2 subspaces of respective 
dimensions 200, 150, 100, 50, 20 and 10. This is a challenging regime since the sum of the subspace 
dimensions equals 2, 200 and exceeds the ambient dimension (the clean data matrix X has full 
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(c) (d) 

Figure 2: Performance of LASSO for values of A in an interval including the heuristic 
A = l/y/d. (a) Average number of false discoveries normalized by (n - d) (FPR) 
on all m sampled data points), (b) FPR for different subspace dimensions. Each 
curve represents the average FPR over those samples originating from subspaces of 
the same dimension, (c) Average number of true discoveries per dimension for various 
dimensions (TPR). (d) TPR vs. FPR (ROC curve). The point corresponding to A = A„ 
is marked as a red dot. 



rank). We use a sampling density equal to p = 5 for each subspace and set the noise level to a = 0.3. 
To evaluate the performance of the optimization problem (2.2), we proceed by selecting a subset of 
columns as follows: for each dimension, we take 100 cases at random belonging to subspaces of that 
dimension. Hence, the total number of test cases is m = 600 so that we only solve m optimization 
problems (2.2) out of the total N possible cases. Below, is the solution to (2.2) and /3^ } its 

(i) 

restriction to columns with indices in the same subspace. Hence, a nonvanishing entry in (3 S is 

(i) 

a true discovery and, likewise, a nonvanishing entry in /3^ c is false. For each data point we sweep 
the tuning parameter A in (2.2) around the heuristic A = \\\fd and work with A e [0.05A o , 2.5A ]. 
In our experiments, a discovery is a value obeying \Bij\ > 10~ 3 . 

In analogy with the signal detection literature we view the empirical averages of \\/3g} \\t /(n-d) 

and ||/3o 9\\i /d as False Positive Rate (FPR) and True Positive Rate (TPR). On the one hand, 
Figures 2a and 2b show that for values around A = A G , the FPR is zero (so there are no false 
discoveries). On the other hand, Figure 2c shows that the TPR curves corresponding to different 
dimensions are very close to each other and resemble those in Figure 1 in which all the points 
belong to the same cluster with no opportunity of making a false discovery. Hence, taking A near 
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gives a performance close to what can be achieved in a noiseless situation. That is to say, we 
have no false discovery and a number of true discoveries about d/2 if we choose A = A . Figure 2d 
plots TPR versus FPR (a.k.a. the Receiver Operating Characteristic (ROC) curve) and indicates 
that A = A (marked by a red dot) is an attractive trade-off as it provides no false discoveries and 
sufficiently many true discoveries. 

2.4.4 A two-step procedure 

Returning to the selection of the regularization parameter, we would like to use A on the order of 
l/y/d. However, we do not know d and proceed by substituting an estimate. In the next section, 
we will see that we are able to quantify theoretically the performance of the following proposal: (1) 
run a hard constrained version of the LASSO and use an estimate d of dimension based on the i\ 
norm of the fitted coefficient sequence; (2) impute a value for A constructed from d. The two-step 
procedure is explained in Algorithm 2. 



Algorithm 2 Two-step procedure with data-driven regularization 



for i = 1 , . . . 

1. Solve 


,N do 

a* 


= arg min 11/3 L subject to \\yi~ 


Y3\\ h <r and A = 0. 


(2.4) 


2. Set A = 

3. Solve 


fiWW 


«x). 

= arg min 1 \\ Vi - Yf3\\l + A ||/3|| 4 


subject to &i = 0. 




4. Set Bi 
end for 


= $■ 









To understand the rationale behind this, imagine we have noiseless data — i. 6.1^ = X — and are 
solving (2.1), which simply is our first step (2.4) with the proviso that r = 0. When there are no 
false discoveries, one can show that the l\ norm of 3* is roughly of size \fd as shown in Lemma 
8.2 from Section 8. This suggests using a multiple of ||/3*||^ 1 as a proxy for \fd. To drive this point 
home, take a look at Figure 3a which solves (2.4) with the same data as in the previous example 
and r = 2<7. The plot reveals that the values of 1/3*1^ fluctuate around \fd. This is shown more 
clearly in Figure 3b, which shows that 11/3*11^ is concentrated around \\fd with, as expected, higher 
volatility at lower values of dimension. 

Under suitable assumptions, we shall see in Section 3 that with noisy data, there are simple 
rules for selecting r that guarantee, with high probability, that there are no false discoveries. To 
be concrete, one can take r = 2<r and f(t) oc Returning to our running example, we have 
~ iV^- Plugging this into A = \\\fd suggests taking f(t) w 0.25/T 1 . The plots in Figure 
4 demonstrate that this is indeed effective. Experiments in Section 6 indicate that this is a good 
choice on real data as well. 

The two-step procedure requires solving two LASSO problems for each data point and is useful 
when there are subspaces of large dimensions (in the hundreds, say) and some others of low- 
dimensions (three or four, say). In some applications such as motion segmentation in computer 
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100 200 300 400 500 600 100 200 300 400 500 600 



(a) (b) 

Figure 3: Optimal values of (2.4) for 600 samples using r = 2a. The first 100 values 
correspond to points originating from subspaces of dimension d = 200, the next 100 
from those of dimension d = 150, and so on through d e {100,50,20, 10}. (a) Value of 
\\(3*\\ ti . (b) Value of \\P*\\ t jVd. 

vision, the dimensions of the subspaces are all equal and known in advance. In this case, one can 
forgo the two-step procedure and simply set A = 1/ \fd. 

3 Theoretical Results 

This section presents our main theoretical results concerning the performance of the two-step pro- 
cedure (Algorithm 2). We make two assumptions: 

• Affinity condition. We say that a subspace Si obeys the affinity condition if 

max aff(S^, Sk) < ^o/log N, (3-1) 
k ■■ k*i 

where kq a fixed numerical constant. 

• Sampling condition. We say that subspace Sg obeys the sampling condition if 

P ap\ (3.2) 

where p* is a fixed numerical constant. 

The careful reader might argue that we should require lower affinity values as the noise level 
increases. The reason why a does not appear in (3.1) is that we assumed a bounded noise level. 
For higher values of a, the affinity condition would read as in (3.1) with a right-hand side equal to 

k-o I dt 

K - CT\ . 

log AT V 2nlogA 

3.1 Main results 

From here on we use d{i) to refer to the dimension of the subspace the vector yi originates from. 
N{i) and p(i) are used in a similar fashion for the number and density of points on this subspace. 
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(a) (b) 




FPR 



(c) (d) 

Figure 4: Performance of the two-step procedure using t - 2a and f(t) = aot' 1 for 
values of ao around the heuristic ao = 0.25. (a) False positive rate (FPR). (b) FPR for 
various subspace dimensions, (c) True positive rate (TPR). (d) TPR vs. FPR. 
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Theorem 3.1 (No false discoveries) Assume that the subspace attached to the ith column obeys 
the affinity and sampling conditions and that the noise level a is bounded as in (1.2), where a* is 
a sufficiently small numerical constant. In Algorithm 2, take t - 2a and f{f) obeying f{t) > 
0. 7 07 at~ 1 . Then with high probability, 4 there is no false discovery in the ith column of B. 

Theorem 3.2 (Many true discoveries) Consider the same setup as in Theorem 3.1 with f also 
obeying f(t) < aot^ 1 for some numerical constant ckq. Then with high probability, 5 there are at least 

log p(i) 

true discoveries in the ith column (cq is a positive numerical constant). 

The above results indicate that the algorithm works correctly in fairly broad conditions. To give 
an example, assume two subspaces of dimension d overlap in a smaller subspace of dimension s but 
are orthogonal to each other in the remaining directions (equivalently, the first s principal angles 
are and the rest are vr/2). In this case, the affinity between the two subspaces is equal to \fs~fd 
and (3.1) allows s to grow almost linearly in the dimension of the subspaces. Hence, subspaces can 
have intersections of large dimensions. In contrast, previous work with perfectly noiseless data [15] 
would impose to have a first principal angle obeying Icos^ 1 ^ < l/\/d so that the subspaces are 
practically orthogonal to each other. Whereas our result shows that we can have an average of the 
cosines practically constant, the condition in [15] asks that the maximum cosine be very small. 

In the noiseless case, [34] showed that when the sampling condition holds and 

max &tt{S e ,S k ) < k - — , 

k:k*t logiV 

(albeit with slightly different values kq and p*), then applying the noiseless version (2.1) of the 
algorithm also yields no false discoveries. Hence, with the proviso that the noise level is not too 
large, conditions under which the algorithm is provably correct are essentially the same. 

Earlier, we argued that we would like to have, if possible, an algorithm provably working at 
(1) high values of the affinity parameters and (2) low values of the sampling density as these are 
the conditions under which the clustering problem is challenging. (Another property on the wish 
list is the ability to operate properly with high noise or low SNR and this is discussed next.) In 
this context, since the affinity is at most one, our results state that the affinity can be within a log 
factor from this maximum possible value. The number of samples needed per subspace is minimal 
as well. That is, as long as the density of points on each subspace is larger than a constant p> p* , 
the algorithm succeeds. 

We would like to have a procedure capable of making no false discoveries and many true discov- 
eries at the same time. Now in the noiseless case, whenever there are no false discoveries, the ith. 
column contains exactly d{i) true discoveries. Theorem 3.2 states that as long as the noise level a is 
less than a fixed numerical constant, the number of true discoveries is roughly on the same order as 
in the noiseless case. In other words, a noise level of this magnitude does not fundamentally affect 
the performance of the algorithm. This holds even when there is great variation in the dimensions 
of the subspaces, and is possible because A is appropriately tuned in an adaptive fashion. 



4 probability at least 1 - 2e 11Tl - 6e~ 72d(i) - e ^ N(l)d(i) - 3f , for fixed numerical constants j x , -y 2 - 
probability at least 1 - 2e" 71 ™ - 6e~ 72d(i) - e~ VJV(i)dW - for fixed numerical constants 71, 72. 
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Figure 5: Histograms of the true discovery values from the two step procedure with 
a = 0.25 (multiplied by \fd). (a) d = 200. (b) d = 20. 



The number of true discoveries is shown to scale at least like dimension over the log of the 
density. This may suggest that the number of true discoveries decreases (albeit very slowly) as the 
sampling density increases. This behavior is to be expected: when the sampling density becomes 
exponentially large (in terms of the dimension of the subspace) the number of true discoveries 
become small since we need fewer columns to synthesize a point. In fact, the d/logp behavior 
seems to be the correct scaling. Indeed, when the density is low and p takes on a small value, (3.3) 
asserts that we make on the order of d discoveries, which is tight. Imagine now that we are in the 
high-density regime and p is exponential in d. Then as the points gets tightly packed, we expect 
to have only one discovery in accordance with (3.3). 

Theorem 3.2 establishes that there are many true discoveries. This would not be useful for 
clustering purposes if there were only a handful of very large true discoveries and all the others of 
negligible magnitude. The reason is that the similarity matrix W would then be close to a sparse 
matrix and we would run the risk of splitting true clusters. This is not what happens and our proofs 
can show this although we do not do this in this paper for lack of space. Rather, we demonstrate 
this property empirically. On our running example, Figures 5a and 5b show that the histograms of 
appropriately normalized true discovery values resemble a bell-shaped curve. 

Finally, we would like to comment on the fact that our main results hold when A belongs to a 
fairly broad range of values. First, when all the subspaces have small dimensions, one can choose 
the same value of A for all the data points since \\\fd is essentially constant. Hence, when we know 
a priori that we are in such a situation, there may be no need for the two step procedure. (We would 
still recommend the conservative two-step procedure because of its superior empirical performance 
on real data.) Second, the proofs also reveal that if we have knowledge of the dimension of the 
largest subspace d max , the first theorem holds with a fixed value of A proportional to cr/vdmax- 
Third, when the subspaces themselves are drawn at random, the first theorem holds with a fixed 
value of A proportional to a (log N)/y/ri. (Both these statements follow by plugging these values 
of A in the proofs of Section 8 and we omit the calclulations.) We merely mention these variants 
to give a sense of what our theorems can also give. As explained earlier, we recommend the more 
conservative two-step procedure with the proxy for 1/y/d. The reason is that using a higher value of 
A allows for a larger value of Kq, which says that the subspaces can be even closer. In other words, 
we can function in a more challenging regime. To drive this point home, consider the noiseless 
problem. When the subspaces are close, the equality constrained t\ problem may yield some false 
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discoveries. However, if we use the LASSO version — even though the data is noiseless — we may 
end up with no false discoveries while maintaining sufficiently many true discoveries. 



4 The Bias-corrected Dantzig Selector 

One can think of other ways of performing the first step in Algorithm 1 and this section discusses 
another approach based on a modification of the Dantzig selector, a popular sparse regression 
technique [12]. Unlike the two-step procedure, we do not claim any theoretical guarantees for this 
method and shall only explore its properties on real and simulated data. 
Applied directly to our problem, the Dantzig selector takes the form 

min \\pt tl subject to {{Y^fa - Yp)\ u < A and A = 0, (4.1) 

where is Y with the ith column deleted. However, this is hardly suitable since the design 

matrix Y is corrupted. Interestingly, recent work [32,33] has studied the problem of estimating a 
sparse vector from the standard linear model under uncertainty in the design matrix. The setup 
in these papers is close to our problem and we propose a modified Dantzig selection procedure 
inspired but not identical to the methods set forth in [32,33]. 

4.1 The correction 

If we had clean data, we would solve (2.1); this is (4.1) with Y = X and A = 0. Let (3 1 be the solution 
to this ideal noiseless problem. Applied to our problem, the main idea in [32,33] would be to find 
a formulation that resembles (4.1) with the property that (3 1 is feasible. Since X{ = X(_j)/3 7 \-{\, 
observe that we have the following decomposition: 

Y^iyi - Y0 1 ) = (X H) + Z H) ) T (z, - Ztf) 

= *H)(*i ~ ZP 1 ) + z f-i) z i ~ ZUZP 1 - 
Then the conditional mean is given by 

EtY^Cw - V/3 7 ) | X] = -E Zf^Z^p 1 ^ = -a 2 p\_ i} . 

In other words, 

where £ has mean zero. In Section 4.2, we compute the variance of the jth component £j, given by 




Owing to our Gaussian assumptions, |£j| shall be smaller than 3 or 4 times this standard deviation, 
say, with high probability. 

Hence, we may want to consider a procedure of the form 

min ||/3|L subject to \\Y ( T i) (y l -Yp) + a 2 p { _ i) \\e oa <X and ft = 0. (4.3) 
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Figure 6: Performance of the bias-corrected Dantzig selector for values of A that are 
multiples of the heuristic A G = \j2\na\f\ + a 2 , (a) False positive rate (FPR). (b) FPR 
for different subspace dimensions, (c) True positive rate (TPR). (d) TPR vs. FPR. 



It follows that if we take A to be a reasonable multiple of (4.2), then /3 1 would obey the constraint 
in (4.3) with high probability. Hence, we would need to approximate the variance (4.2). Numerical 
simulations together with asymptotic calculations presented in Appendix C give that H/^H^ < 1 
with very high probability. Thus neglecting the term in ((3j) 2 , 

Eel--(l^ 2 )(l + ll« 2 )<2^(l + a 2 ). 
J n n 

This suggests taking A to be a multiple of \j2\no yl + a 2 . This is interesting because the parameter 
A does not depend on the dimension of the underlying subspace. We shall refer to (4.3) as the bias- 
corrected Dantzig selector, which resembles the proposal in [32,33] for which the constraint is a bit 
more complicated and of the form _ Y/3) + D(-i)P\i^ < H/^IUi + A. 

To get a sense about the validity of this proposal, we test it on our running example by varying 
A e [A ,8A ] around the heuristic A Q = \Jl\na \J\ + a 2 . Figure 6 shows that good results are 
achieved around factors in the range [4,6]. 

In our synthetic simulations, both the two-step procedure and the corrected Dantzig selector 
seem to be working well in the sense that they yield many true discoveries while making very few 
false discoveries, if any. Comparing Figures 6b and 6c with those from Section 2 show that the 
corrected Dantzig selector has more true discoveries for subspaces of small dimensions (they are 
essentially the same for subspaces of large dimensions); that is, the two-step procedure is more 
conservative when it comes to subspaces of smaller dimensions. As explained earlier this is due to 
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our conservative choice of A resulting in a TPR about half of what is obtained in a noiseless setting. 
Having said this, it is important to keep in mind that in these simulations the planes are drawn 
at random and as a result, they are sort of far from each other. This is why a less conservative 
procedure can still achieve a low FPR. When smaller subspaces are closer to each other or when 
the statistical model does not hold exactly as in real data scenarios, a conservative procedure may 
be more effective. In fact, experiments on real data in Section 6 confirm this and show that for the 
corrected Dantzig selector, one needs to choose values much larger than A D to yield good results. 



4.2 Variance calculation 



By definition, 

£j = (xj,Zi - Z/3 1 ) + (zj,Zi) - (zjzj - a 2 )f3j - £ zjz k ^ k := I x + J 2 + h + h- 

k.k+i,j 

A simple calculation shows that for l\ + £2, Cov(J^ 1 ,/^ 2 ) = so that 

E£ 2 = £Var(/,). 

l=\ 

We compute 

Vax(ii) = -(1 + 1/3*12,), Var(/ 3 ) = -2(/?/) 2 , 
n n 

Var(7 2 ) = -, Var(/ 4 ) = -[10% - (/3*) 2 ] 

n n J 

and (4.2) follows. 



5 Comparisons With Other Works 

We now briefly comment on other approaches to subspace clustering. Since this paper is theoretical 
in nature, we shall focus on comparing theoretical properties and refer to [16], [40] for a detailed 
comparison about empirical performance. Three themes will help in organizing our discussion. 

• Tractability. Is the proposed method or algorithm computationally tractable? 

• Robustness. Is the algorithm provably robust to noise and other imperfections? 

• Efficiency. Is the algorithm correctly operating near the limits we have identified above? In 
our model, how many points do we need per subspace? How large can the affinity between 
subspaces be? 

One can broadly classify existing subspace clustering techniques into three categories, namely, 
algebraic, iterative and statistical methods. 

Methods inspired from algebraic geometry have been introduced for clustering purposes. In this 
area, a mathematically intriguing approach is the generalized principal component analysis (GPCA) 
presented in [41]. Unfortunately, this algorithm is not tractable in the dimension of the subspaces, 
meaning that a polynomial-time algorithm does not exist. Another feature is that GPCA is not 
robust to noise although some heuristics have been developed to address this issue, see e.g. [26]. 
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As far as the dependence upon key parameters is concerned, GPCA is essentially optimal. An 
interesting approach to make GPCA robust is based on semidefinite programming [30]. However, 
this novel formulation is still intractable in the dimension of the subspaces and it is not clear how 
the performance of the algorithm depends upon the parameters of interest. 

A representative example of an iterative method — the term is taken from the tutorial [40] — is 
the K-subspace algorithm [36], a procedure which can be viewed as a generalization of K-means. 
Here, the subspace clustering problem is formulated as a non-convex optimization problem over the 
choice of bases for each subspace as well as a set of variables indicating the correct segmentation. 
A cost function is then iteratively optimized over the basis and the segmentation variables. Each 
iteration is computationally tractable. However, due to the non-convex nature of the problem, the 
convergence of the sequence of iterates is only guaranteed to a local minimum. As a consequence, 
the dependence upon the key parameters is not well understood. Furthermore, the algorithm can 
be sensitive to noise and outliers. Other examples of iterative methods may be found in [1,9,23,45]. 

Statistical methods typically model the subspace clustering problem as a mixture of degenerate 
Gaussian observations. Two such approaches are mixtures of probabilistic PCA (MPPCA) [35] 
and agglomerative lossy compression (ALC) [24]. MPPCA seeks to compute a maximum- likelihood 
estimate of the parameters of the mixture model by using an expected-maximization (EM) style 
algorithm. ALC searches for a segmentation of the data by minimizing the code length necessary 
(with a code based on Gaussian mixtures) to fit the points up to a given distortion. Once more, 
due to the non-convex nature of these formulations, the dependence upon the key parameters and 
the noise level is not understood. 

Many other methods apply spectral clustering to a specially constructed graph [2,8,13,18,43,46]. 
They share the same difficulties as stated above and [40] discusses advantages and drawbacks. An 
approach similar to SSC is called low-rank representation (LRR) [21]. The LRR algorithm is 
tractable but its robustness to noise and its dependence upon key parameters is not understood. 
The work in [20] formulates the robust subspace clustering problem as a non-convex geometric 
minimization problem over the Grassmanian. Because of the non-convexity, this formulation may 
not be tractable. On the positive side, this algorithm is provably robust and can accommodate 
noise levels up to 0(1/ (Ld 3 ^ 2 )). However, the density p required for favorable properties to hold 
is an unknown function of the dimensions of the subspaces (e.g. p could depend on d in a super 
polynomial fashion). Also, the bound on the noise level seems to decrease as the dimension d 
and number of subspaces L increases. In contrast, our theory requires p > p* where p* is a 
fixed numerical constant. While this manuscript was under preparation we learned of [17] which 
establishes robustness to sparse outliers but with a dependence on the key parameters that is super- 
polynomial in the dimension of the subspaces demanding p > Cod lo&n . (Numerical simulations 
in [17] seem to indicate that p cannot be a constant.) 

Finally, the papers [22,32,33] also address regression under corrupted covariates. However, there 
are three key differences between these studies and our work. First, our results show that LASSO 
without any change is robust to corrupted covariates whereas these works require modifications 
to either LASSO or the Dantzig selector. Second, the modeling assumptions for the uncorrupted 
covariates are significantly different. These papers assume that X has i.i.d. rows and obeys the 
restricted eigenvalue condition (REC) whereas we have columns sampled from a mixture model 
so that the design matrices do not have much in common. Last, for clustering and classification 
purposes, we need to verify that the support of the solution is correct whereas these works establish 
closeness to an oracle solution in an £2 sense. In short, our work is far closer to multiple hypothesis 
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testing. 



6 Numerical Experiments 

In this section, we perform numerical experiments corroborating our main results and suggesting 
their applications to temporal segmentation of motion capture data. In this application we are given 
sensor measurements at multiple joints of the human body captured at different time instants. The 
goal is to segment the sensory data so that each cluster corresponds to the same activity. Here, 
each data point corresponds to a vector whose elements are the sensor measurements of different 
joints at a fixed time instant. 

We use the Carnegie Mellon Motion Capture dataset (available at http://mocap.cs.cmu.edu), 
which contains 149 subjects performing several activities (data are provided in [47]). The motion 
capture system uses 42 markers per subject. We consider the data from subject 86 in the dataset, 
consisting of 15 different trials, where each trial comprises multiple activities. We use trials 2 
and 5, which feature more activities (8 activities for trial 2 and 7 activities for trial 5) and are, 
therefore, harder examples relative to the other trials. Figure 7 shows a few snapshots of each 
activity (walking, squatting, punching, standing, running, jumping, arms-up, and drinking) from 
trial 2. The right plot in Figure 7 shows the singular values of three of the activities in this trial. 
Notice that all the curves have a low-dimensional knee, showing that the data from each activity 
lie in a low-dimensional subspace of the ambient space (n = 42 for all the motion capture data). 




Walking 
Jumping 
Drinking 



15 20 



Figure 7: Left: eight activities performed by subject 86 in the CMU motion cap- 
ture dataset: walking, squatting, punching, standing, running, jumping, arms-up, and 
drinking. Right: singular values of the data from three activities (walking, jumping, 
drinking) show that the data from each activity lie approximately in a low-dimensional 
subspace. 

We compare three different algorithms: a baseline algorithm, the two-step procedure and the 
bias-corrected Dantzig selector. We evaluate these algorithms based on the clustering error. That 
is, we assume knowledge of the number of subspaces and apply spectral clustering to the similarity 
matrix built by the algorithm. After the spectral clustering step, the clustering error is simply 
the ratio of misclassified points to the total number of points. We report our results on half of 
the examples — downsampling the video by a factor 2 keeping every other frame — as to make the 
problem more challenging. (As a side note, it is always desirable to have methods that work well 
on a smaller number of examples as one can use split-sample strategies for tuning purposes). 6 

6 We have adopted this subsampling strategy to make our experiments reproducible. For tuning purposes, a random 
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(a) Trial 2. (b) Trial 5. 

Figure 8: Minimum clustering error (%) for each K in the baseline algorithm. 

As a baseline for comparison, we apply spectral clustering to a standard similarity graph built 
by connecting each data point to its .fT-nearest neighbors. For pairs of data points, yi and yj, 
that are connected in the if-nearest neighbor graph, we define the similarities between them by 
Wij = exp(-||i/j - yylH/t), where t > is a tuning parameter (a.k.a. temperature). For pairs of data 
points, yi and yj, that are not connected in the K- nearest neighbor graph, we set Wij = 0. Thus, 
pairs of neighboring data points that have small Euclidean distances from each other are considered 
to be more similar, since they have high similarity Wij. We then apply spectral clustering to the 
similarity graph and measure the clustering error. For each value of K, we record the minimum 
clustering error over different choices of the temperature parameter t > as shown in Figures 8a 
and 8b. The minimum clustering error for trials 2 and 5 are 17.06% and 12.47%. 

For solving the LASSO problems in the two-step procedure, we developed a computational 
routine made publicly available [37] based on TFOCS [7] solving the optimization problems in 
parallel. For the corrected Dantzig selector we use a homotopy solver in the spirit of [38]. 

For both the two-step procedure and the bias-corrected Dantzig selector we normalize the data 
points as a preprocessing step. We work with a noise a in the interval [0.001,0.045], and for 
each value of a, we vary A around 1/A = 4||/3*||^ and A a = \j2\no \/l + a 2 . After building the 
similarity graph from the sparse regression output, we apply spectral clustering as explained earlier. 
Figures 9a, 9b and 10 show the clustering error (on trial 5) and the red point indicates the location 
where the minimum clustering error is reached. Figures 9a and 9b show that for the two-step 
procedure the value of the clustering error is not overly sensitive to the choice of a — especially 
around A = A G . Notice that the clustering error for the robust versions of SSC are significantly 
lower than the baseline algorithm for a wide range of parameter values. The reason the baseline 
algorithm performs poorly in this case is that there are many points that are in small Euclidean 
distances from each other, but belong to different subspaces. 

Finally a summary of the clustering errors of these algorithms on the two trials are reported in 
Table 1. Robust versions of SSC outperform the baseline algorithm. This shows that the multiple 
subspace model is better for clustering purposes. The two-step procedure seems to work slightly 
better than the corrected Dantzig selector for these two examples. Table 2 reports the optimal 
parameters that achieve the minimum clustering error for each algorithm. The table indicates that 
on real data, choosing A close to A Q also works very well. Also, one can see that in comparison with 



strategy may be preferrable. 
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Figure 9: Clustering error (%) for different values of A and a on trial 5 using the two 
step procedure (a) 3D plot (minimum clustering error appears in red), (b) 2D cross 
sections. 
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Figure 10: Clustering error (%) for different values of A and a on trial 5 using the 
corrected Dantzig Selector. 
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Figure 11: Box plot of the affinities between subspaces for trials 2 and 5. 

the synthetic simulations of Section 4, a more conservative choice of the regularization parameter A 
is needed for the corrected Dantzig selector as A needs to be chosen much higher than A to achieve 
the best results. This may be attributed to the fact that the subspaces in this example are very 
close to each other and are not drawn at random as was the case with our synthetic data. To get 
a sense of the affinity values, we fit a subspace of dimension to the Ng data points from the £th 
group, where dg is chosen as the smallest nonnegative integer such that the partial sum of the dg 
top singular values is at least 90% of the total sum. Figure 11 shows that the affinities are higher 
than 0.75 for both trials. 

7 Discussion and Open Problems 

In this paper, we have developed a tractable algorithm that can provably cluster data points in a 
fairly challenging regime in which subspaces can overlap along many dimensions and in which the 
number of points per subspace is rather limited. Our results about the performance of the robust 
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: Minimum clustering 


error. 




Baseline algorithm 


Two-step procedure 


Corrected Dantzig selector 


Trial 2 


K=9, t=0.0769 


a = 0.03 , A = 1.25A D 


a = 0.004 , A = 41.5A 


Trial 5 


K=6, t=0.0455 


ct = 0.01, A= A 


a = 0.03 , A = 45.5A 



Table 2: Optimal parameters. 



SSC algorithm are expressed in terms of interpretable parameters. This is not a trivial achievement: 
one of the challenges of the theory for subspace clustering is precisely that performance depends 
on many different aspects of the problem such as the dimension of the ambient space, the number 
of subspaces, their dimensions, their relative orientations, the distribution of points around each 
subspace, the noise level, and so on. Nevertheless, these results only offer a starting point as 
our work leaves open lots of questions, and at the same time, suggests topics for future research. 
Before presenting the proofs, we would like to close by listing a few questions colleagues may find 
of interest. 

• We have shown that while having the affinities and sampling densities near what is information 
theoretically possible, robust versions of SSC that can accommodate noise levels a of order 
one exist. It would be interesting to establish fundamental limits relating the key parameters 
to the maximum allowable noise level. What is the maximum allowable noise level for any 
algorithm regardless of tractability? 

• It would be interesting to extend the results of this paper to a deterministic model where 
both the orientation of the subspaces and the noiseless samples are non-random. We leave 
this to a future publication. 

• Our work in this paper concerns the construction of the similarity matrix and the correctness 
of sparse regression techniques. The full algorithm then applies clustering techniques to clean 
up errors introduced in the first step. It would be interesting to develop theoretical guarantees 
for this step as well. A potential approach is the interesting formulation developed in [4]. 

• A natural direction is the development of clustering techniques that can provably operate with 
missing and/or sparsely corrupted entries (the work [34] only deals with grossly corrupted 
columns). The work in [17] provides one possible approach but requires a very high sampling 
density as we already mentioned. The paper [16] develops another heuristic approach without 
any theoretical justification. 

• One of the advantages of the suggested scheme is that it is highly parallelizable. When the al- 
gorithm is run sequentially, it would be interesting to see whether one can reuse computations 
to solve all the ^-minimization problems more effectively. 
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8 Proofs 



We prove all of our results in this section. Before we begin, we introduce some notation. If A is 
a matrix with iV columns and T a subset of {1, . . . ,N}, At is the submatrix with columns in T. 
Similarly, xt is the restriction of the vector x to indices in T. Throughout we use L m to denote 
logm up to a fixed numerical constant. The value of this constant may change from line to line. 
Likewise, C is a generic numerical constant whose value may change at each occurrence. 

Next, we work with y ■- y± for convenience, assumed to originate from S±, which is no loss of 
generality. It is also convenient to partition Y as {Y^ 1 ' , Y^ 2 \ . . . , Y^ L '}, where for each I, yW 
are those noisy columns from subspace Sf, when I = 1, we exclude the response y\ fromyW. With 
this notation, the problem (2.2) takes the form 



1 



mm 

(3eR N -i 



y-(y( 1 )/3« 







(i) 



+ ... + A 



/3< L > 



Throughout, y\\/YS denotes the projection of the vector/matrix y/Y^ onto Si. Similarly, we 
use y L to denote projection onto the orthogonal complement S^; hence, y = y« + y± and yW = 



-(I) 



+ Y} K Moreover, U\ e R nxd and e R nx ( n d ) are orthonormal bases for S\ and S^. 



x\\g 2 = 1 and E||z||||| 2 = a 2 d/n, it is obvious that under the stated 



Since y\\ = x + Z\\ with \\x\\g 2 = 1 and E||«|||| = a 2 
assumptions, ||y||||^ 2 € [3/4,5/4] with very high probability as shown in Lemma A. 4. The same 

applies to all the columns of Y« . From now on, we will operate under these two assumptions, 
which hold simultaneously over an event of probability at least 1-1/N. 



8.1 Intermediate results 

In this section, we record a few 
many true discoveries theorems, 
over this section rather quickly, 
technical lemmas below. 



important results that shall be 
Now the reader interested in 
and return to it once it is clear 



used to establish the no-false and 
our proofs may first want to pass 
how our arguments reduce to the 



8.1.1 Preliminaries 

Our first lemma rephrases Lemma 7.5 in [34] and bounds the size of the dot product between 
random vectors. We omit the proof. 

Lemma 8.1 Let A e [R d i xAr i fr e a matrix with columns sampled uniformly at random from the unit 
sphere of R dl , w e R d2 be a vector sampled uniformly at random from the unit sphere of R d2 and 
independent of A and S e R dixd2 be a deterministic matrix. We have 

II 5] II 

|A T Sw| < 0ogalog6- 11 " F ' Q ^ 



r d\\Jd~2 



with probability at least 1 — j=— 2 ^f- . 

We are interested in this because (8.2) relates the size of the dot products with the affinity between 
subspaces as follows: suppose the unit-norm vector X{ is drawn uniformally at random from Si, 
then 

X ij)T x t = A Tl Ew; 
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A and w are as in the lemma and I] = U^' where U^' (resp. U^') is an orthobasis for Sj 

(resp. Si). By definition, = a di &ff(Sj,Si). 

8.1.2 The first step of Algorithm 2 

As claimed in Section 2, the first step of Algorithm 2 returns an optimal value that is a reasonable 
proxy for the unknown dimension. 

Lemma 8.2 Let Val(Step l)be the optimal value of (2.4) with r = 2a. Assume p\ > p* as earlier. 
Then 

^a/ r^- < Val(Step 1) < 2^a\. (8.3) 
10 V logpi 

The upper bound holds with probability at least 1 - e 71 ™ - e~" l2dl . The lower bound holds with 
probability at least 1 - e~ ridl - ^ . 

rp rp — X 

Proof We begin with the upper bound. Let /3 = (X^X^ ) x be the minimum fo-norm 

solution to the noiseless problem X(3q = x. We show that (3q is feasible for (2.4). We have 

y - Yfo = x - X(3 + (z - ZA)) = « - Zjflo, 

which gives 



£(z - Z(3 \X,x)= Af(0, VI n ), V = (1 + ||/3 || £ 2 > 2 /n 
(the notation £(Y|A") is the conditional law of Y given X). Hence, the conditional distribution of 

II 2 

Wt 2 



\z - ZPoWj is that of a chi square and (A.l) gives 



\\z-Z[3oh 2 <\/2(l + \\[3o\\l)o- 
with probability at least 1 - e _7in . On the other hand, 



x 



e-2 



O-miniXW) 

and applying Lemma A. 2 gives 

IIA4 2 : ' 



Ni_ 2 



which holds with probability at least 1 -e 72fi(l . If N\ > 9di, then ||/3o|k 2 - 1 ana - thus Po is feasible. 
Therefore, 



lFl tl < \\Po\\ ei <V*hlMi a * nr ± 2xA, 



1 — O-l J±L 
1 ^' Ni 



where the last inequality holds provided that Aq > 16di. 

We now turn to the lower bound and let f3* be an optimal solution. Notice that ||j/|| - Yuj3* \\t 2 < 
|| y - Y(3* \\e 2 < 2a so that (3* is feasible for 

min \\p\\ ti subject to \\y [l -Y [l p\\ h <2a. (8.4) 
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We bound the optimal value of this program from below. The dual of (8.4) is 

max , v) - 2a subject to ^ < 1. (8-5) 

Slater's condition holds and the primal and dual optimal values are equal. To simplify notation set 



A = y;, (1) . Define 



v* s arg max (y\\ , is) subject to ||A T i/|| < 1. 



Notice that v* has random direction on subspace S\. Therefore, combining Lemmas 8.1, A.l, 
and B.4 together with the affinity condition implies that for £ + 1, \\Y^ v < 1 with high 
probability. In short, v* is feasible for (8.5). 

Since j/n has random direction, the arguments (with t = 1/6) in Step 2 of the proof of Theorem 
2.9 in [34] give 

/ *\ ^ 1 / di 



/2ne V log Pi 
Also, by Lemma B.4, 

16 / di 

W \\o< 



3 V logpi 

Since v* is feasible for (8.5), the optimal value of (8.5) is greater or equal than 



/ *\ o II *n ^ 1 i ^ 



10 V logpi 

where the inequality follows from the bound on the noise level. This concludes the proof. 



8.1.3 The reduced and projected problems 

When there are no false discoveries, the solution to (8.1) coincides with that of the reduced problem 



€ argmin- y-Y (1) (3 (1) 



+ X 



13 



(i) 



(8.6) 



Not surprisingly, we need to analyze the properties of the solution to this problem. In particular, 
we would like to understand something about the orientation and the size of the residual vector 

A problem close to (8.6) is the projected problem 



(3^ e argmin- yu -Y} ±J (3 



-d)o(i) 



C-2 



+ \ 







(i) 



•7) 



The difference with the reduced problem is that the goodness of fit only involves the residual sum 
of squares of the projected residuals. Intuitively, the solutions to the two problems (8.6) and (8.7) 
should be close. Our strategy is to gain some insights about the solution to the reduced problem 
by studying the properties of the projected problem. 
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8.1.4 Properties of the projected problem 

The sole purpose of this subsection is to state this: 

Lemma 8.3 Let (3^ be any solution to the projected problem and assume that N\/di > p* as 
before. Then there exists an absolute constant C such that for all A > 0, 







(i) 



<C 



holds with probability at least 1 - 5e lldl - e ^ Nldl . 

This estimate shall play a crucial role in our arguments. It is a consequence of sharp estimates 
obtained by Wojtaszczyk [42] in the area of compressed sensing. As not to interrupt the flow, we 
postpone its proof to Section 8.3. 

In the asymptotic regime (pi - Ni/d\ fixed and di -*■ oo), one can sharpen the upper bound 
(8.8) by taking (7=1. This leverages the asymptotic theory developed in [5] and [6] as explained 
in Appendix C. 

8.1.5 Properties of the reduced problem 

We now collect two important facts about the residuals to the reduced problem. The first concerns 
their orientation. 

Lemma 8.4 (Isotropy of the residuals) The projection of the residual vector r = y- yW/^ 1 ) 
onto either S\ or has uniform orientation. 

Proof Consider any unitary transformation U" (resp. U 1 ) leaving Si (resp. S^) invariant. Since 



+ A 







(i) 



(!)«(!) 



f-2 



+ A 







(i) 



the LASSO functional is invariant and this gives 

/3 (1 H^^||^V,^^ | (1 \^^ (1) ) = /3 (1 Hy||,^^ | (1) ,n (1) )- 

Let r"" , y^| ( ' 1 - ) ) = y\\-Y^ (3^ and r ± (y 1 , Y]_ ) = y±-Y± 1 ' fl^ be the projections of the residuals. 
Since y\\ and yS are invariant under rotations leaving Si invariant, we have 

^-U^^Hy^Y^^) 
~y,-Y^^\y h y L ^\Y^) 
= r\y h Y^), 

where X ~ Y means that the random variables X and Y have the same distribution. Therefore, 
the distribution of r" (y\\ , Y\\ ) = y\\ ~ yS 1 ^ (3^ is invariant under rotations leaving Si invariant. 
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In other words, the projection r>> has uniform orientation. In a similar manner we conclude that 
r 1 has uniform orientation has well. ■ 

The next result controls the size of the residuals. 
Lemma 8.5 (Size of residuals) If Nijd\ > p* , then for all A > 0, 



< Co-. 



Also, 



3 Vlog(iViM) 



+ Ca. 



(8.9) 



U0) 



Both these inequalities hold with probability at least 1 - e 71 ( n - 5e 72rfl - e ^ Nldl ; where 71 and 
72 are /ixed numerical constants. Thus if A > a/\/8d\, then 



V\\ 



f2 



Proof We begin with (8.9). Since is optimal for the reduced problem, 



y-F (1) /3 (1) +A $ 



< 
2 



y-F (1) /3 (1) +A /3 



^2 



1(1) 



Conversely, since is optimal for the projected problem, 



(D/q(l) 



+ A 



/3 



(i) 



< - 
li 2 



y,|-^ (1) /3« ]+X$ 



l(i) 



Now Parseval equality 



2/11 -*r^ 



a)fl(i) 



(2 



(and similarly for /3^') together with the last two inequalities give 



ill) 



(i)«(i) 



^2 



Now observe that y||, y ± , 5j and are all independent from each other. Since is a 
function of y\\ and Yj^, it is independent from y L and Hence, 

Conditionally then, it follows from the chi-square tail bound (A.l) with e = 1 that 



y± 



^ ( t,^t-!)(HK,) 



which holds with probability at least 1 - e 71 ( n where 71 = ^— Unconditionally, our first 
claim follows from Lemma 8.3. 
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We now turn our attention to (8.10). Our argument uses the solution to the noiseless projected 
problem 



/3« 



e arg mm 



„ subject to y\\ = ^, (1) /3 (1) 



(8.12) 



this is the solution to the projected problem as A -> + . With this, we proceed until (8.13) as 
in [10,11]. Since is optimal for the reduced problem, 



< 

h 2 



+ A 



/3« 



Put h = - j3^\ Standard simplifications give 



(2 



+ X 



Letting S be the support of we have 



+ =||j95 + M/ + > /3 (1) , +(sgn(^),M + ll^ 



This yields 



Y-W/i , + A||/i 5c || £i <(y-r( 1 )^ 1 ),FW^)-A(sgn(4 1) ),/ ls ). 



«2 



By definition, y« - = 0, and thus 



y\\ 
1 

2 



(8.13) 



Continue with 

(Y^h,y ± -Y^0^}< 
so that 



+ A||% C |U<J yi-F^^ 1 ) * -A(sgn(4 1) ) ) / l5 >. 



yW^l) 



1 

< - 
h 2 



2 1 
+ - 

<?2 2 



(■2 



(8.14) 



Now set A = yj| for notational convenience. Since is optimal, there exists v such that 



Also, Corollary B.4 gives 

With this 
We have 



v = AV, d 5 = sgn(/3^ 1} ) and 11^11^, < 1- 



„2 256 ^1 
V 1 < 



9 Iog(JVi/di) - 



(8.15) 



(sgn(/3^ 1) ),/i 5 ) = (v s ,h s ) = (v,Ah)-(v S c,h S c). 



\{u,Ah)\ < \\Ah\\ h M t2 < ^ \\Ah\\ + \\\v\] 2 
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and, therefore, 



\\( S gn(pW),h s )\ < \ \\Ah\l + A 2 \\vf l2 + A 



Plugging this into (8.14), we obtain 



1 



1 



- AMI < - 



t'2 



+ A 2 Ml 



5.16) 



This concludes the proof since by definition Ah = y\\ - yP~' and since we already know that 

l|yi-^ (1) ^ (1) lll<cV. ■ 



8.2 Proof of Theorem 3.1 

First, Lemma 8.2 asserts that by using r and f{t) as stated, our choice of A obeys 

A >ir (ai7) 

All we need is to demonstrate that when A is as above, there are no false discoveries. To do this, 
it is sufficient to establish that the solution /3W to the reduced problem obeys 



Y^ T {y-Y^^) 



<A, foralU*!. (8.18) 



This is a consequence of this: 

Lemma 8.6 Fix A e R dxN and Tc {1,2,..., N}. Suppose that there is a solution x* to 

1 2 



mm — \\y - Ax\\g 2 + A lasl^ subject to xt c = 



obeying \\A^ c (y - Acc*)^ < A. Then any optimal solution x to 



■ 1 II A l|2 x || II 

mm - \y - Ax\^ 2 + A ||a?|| 



must also satisfy xt c = 0. 



Proof Consider a perturbation x* + th. For t > sufficiently small, the value of the LASSO 
functional at this point is equal to 

i \\y - A(x* + th)\\l + X\\x + th\\ h =\\y- Ax*\\ 2 h - t(A T (y -Ax*),h) + ^ \\Ah\\l 

+ A \\x\\ ei + Xt(sgn(x T ) , h T ) + At ||/it c ||4 • 

Now since the optimality conditions give that A^,(y - Ax*) = Asgn(ajT) and that by assumption, 
A?rc(y - Ax*) = Ae^c with Hei^H&x, < 1 ; the value of the LASSO functional is equal to 



1„ . .,,2 t 



2 



y- Ax*\\ l2 +A||sc|^ + — \\Ah\\ h + Xt(\\h TC \\ ei -(e T ^h T c)). 
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Clearly, when t 0, the value at x* + th is strictly greater than that at x* , which proves the 
claim. ■ 

We return to (8.18) and write 



<£) J 



T . 



+ Z^ T ( yil -Y^0W) + Z^\ yi -Y^^). 



To establish (8.18), we shall control the loo norm of each term by using Lemma 8.1 and the estimates 
concerning the size of the residuals. For ease of presentation we assume d\ > di and dg < n-d\ — the 
proof when dg > d\ is similar. 

The term X^ 7 \y {{ - Y, (1) /3W). Using Lemma 8.1 with a = y/ThgN, b = 2y/EgN, we have 



X^( yil -Y^^>) 



<781ogiV aff( ^ ) 



y\\ - y, re 



£2 



this holds uniformly over £ + 1 with probability at least l - ^- Now applying Lemma 8.5 we conclude 
that 



xW r (y |r >f>/3«) 
The term X^ T (y L - Y^ 1} 0^). As before, 



< XL N aS(Si,S e ) := Xh. 



T 



< \/81ogA^ — y L - 1^ (1) /3 (1) 

y/n-di 



which holds uniformly over £ # 1 with probability at least 1 - (we used the fact that the affinity 
is at most one.) Applying Lemma 8.5 gives 



X^V-^ (1) /3 (1) ) 



T ° T 

^ j->n '■- h- 
>n 



The terms Z^ T (y {{ - Y^ 1 and Z^ T (y ± - Y^fi^). Since Z^ is a Gaussian matrix 
with entries M(0,a 2 /n), applying Lemma A.l gives 



< 2a 



< CXa 



logiV 



■n 



y\\-Yrp 



(!)«(!) 



di IorN 



n 



■=Xh 



with probability at least 1 - 2 



v 



In a similar fashion with probability at least 1 - , we have 



Z(*>V-^ (1) /3 (1) ) 
Putting all this together, we need 



< a' 



n 



:=h. 



(ii + h)X + I 2 + h < X 
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to hold with high probability. It is easy to see that the affinity condition in Theorem 3.1 is equivalent 
to 1% + J3 < 1 — t=. Therefore it suffices to have A > V3(h + Z4). The latter holds if 



x/3 

A > L N - 

n 



The calculations have been performed assuming (8.17). Therefore, it suffices to have 

A> v^ max ( 1,L7v v?)- 

The simplifying assumption d\ < n/Lfj at the beginning of the paper concludes the proof. 

8.3 The size of the solution to the projected problem 

This section proves Lemma 8.3. We begin with two definitions. 

Definition 8.7 (Inradius) The inradius r(V) of a convex body V is the radius of the largest 
Euclidean ball inscribed in V. 

Definition 8.8 (Restricted isometry property (RIP)) We say that A e R dxN obeys RIP(s,8) 
*f 

(!-<*) Mia ^ W Ax Wl2 ^( 1 + 6 ) \\ x \\e 2 
holds for all s-sparse vectors (vectors such that \\x\\g < s). 

We mentioned that Lemma 8.3 is essentially contained in the work of Wojtaszczyk and now 
make this clear. Below, x is any optimal solution to 

min || a; || ^ subject to y - Ax. (8.19) 



Theorem 8.9 [42, Theorem 3.4] Suppose A e R dxN obeys RIP(s,6) and r(V(A)) > where 
V(A) is the symmetrized convex hull of the columns of A. Then there is a universal constant 
C = C{5,[i), such that for any solution x to y = Ax, we have 

\\x-x\\ h <C\\x-x {s) \\ e2 + C\\y-Ax {s) \\ £2 . (8.20) 

Above, xr s \ is the best s-sparse approximation to x (the vector x with all but the s-largest entries 
set to zero). 

We now prove Lemma 8.3 and begin with A = 0. The expected squared Euclidean norm of a 
column of is equal to 1 + a 2 d/n. Rescaling ^ as A = (1 + a 2 d/n)~ 1 ^ 2 Y^ 1 \ it is a simple 

calculation to show that with s = y/di/L Nl / dl (recall that L Nl j dl is a constant times log(iVi/di)), 
A obeys RIP(,s,(5) for a fixed numerical constant 5. For the same value of s, a simple rescaling of 
Lemma B.3 asserts that r(T > (A)) > nj\/s for a fixed numerical constant \x. 
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Now let be the pseudo-inverse of A, and set (3 = A^y\\ . First, \y\\ ||^ 2 e [3/4, 5/4] and second 
\(3\i 2 < 1 as shown in Lemma 8.2. Thus, 

/3« - (3\\ e2 < C \\(3\\ h + C ||y„ - A/3 (s) \\ h < % -C + C(l + <5). 

The second inequality comes from the RIP property ||A/3( S )||^ 2 < (1 + <5)||/3( s )||£ 2 - (1 + This 
completes the proof for A = 0. For A > 0, one simply applies the same argument with yp replaced 

bylf^ 1 ). 

8.4 Proof of Theorem 3.2 

Once we know there are no false discoveries, the many many-true-discovery result becomes quite 
intuitive. The reason is this: we already know that the residual sum of squares ||y|| - Y^/3^ \\"j 2 is 
much smaller than one provided A is not too large — this is why we have the upper bound /(t) < ao/t. 
Now recall that y« is a generic point in a d-dimensional subspace and, therefore, it cannot be well 
approximated as a short linear combination of points taken from the same subspace. It is possible — 
and not difficult — to make this mathematically precise and thus prove Theorem 3.2. Here, we take 
a shorter route re- using much of what we have already seen and/or established. 

We rescale Yj, (1) as A = (1 + a 2 dln)- l l 2 Y^ l) to ensure that the expected squared Euclidean 
norm of each column is one, and set s = y/di/L^/^. We introduce three events E\, Ei and E3. 

• E\\ there are no false discoveries. 

• E2: with s as above, the two conditions in Theorem 8.9 hold. 

• E3 - ||^ > ci \Jd\l\ogp\} for some numerical constant c\. 

bmce f(t) > <r/(V2t), there are no false discoveries with high probability. Further, in the proof of 
this theorem we established that E\ and E% hold with high probability. Last, E% also has large 
probability. 

Lemma 8.10 Let /(f) be as in Theorem 3.2, then the event E% has probability at least l-e~ 73<il -j? . 

Proof The proof is nearly the same as that of the lower bound in Lemma 8.2. By definition, the 
point 0^ is a solution to 

min||/3||^ subject to y-Y^(3 <r 

with t - ||y - Y^ 0^\\i 2 . Hence, if we can take r to be sufficiently smaller than one, then the 
argument in the proof of Lemma 8.2 can be copied to establish the claim. 
Moving forward, Lemma 8.5 gives 



12 3 V log Pi 



< 32 a . ( < a 
~ 3 Val(Step 1) V lo SPi 

< 110a + Ccr. 



7 Notice that the properties needed for this lemma to hold are the same as that of Lemma 8.2 and, therefore, E3 holds 
under the conditions of the no-false discovery theorem. Hence, the guaranteed probabilities of success of our two 
main theorems are the same. 
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The second inequality follows from the definition of the regularization parameter since A < ao/Val(Step 1) 
and the third from (8.3) in Lemma 8.2. We have proved the claim provided ao and a are sufficiently 
small (this is the place where the assumption about a comes into play). ■ 



Lemma 8.11 On the event E<± n IUo - s }> where s is as above, 



(3 



(i) 



<c 2 



for some numerical constant c%- 



Proof We apply Theorem 8.9 one more time with A as before, y - (1 + a 2 d/n) 1 ^ 2 y\\, x = f3^ 
and x = Z^ 1 ). By assumption, xr s \ = x, and 



- {1) \\ e < C(l + a'd/ny 1 ' 2 \\ y]] - Y^pM 



< c 



(the second inequality comes from Lemma 8.5). The proof follows by applying the triangular 
inequality and Lemma 8.3 with A = 0. ■ 

The proof of Theorem 3.2 is now straightforward. With s as above, there is nothing to do if 
|| £ - s > so we assume \\/3\\e < s. On this event and E\ n E% n E3, 



V log Pi 



h > c 31 / dl 



Cauchy-Schwartz asserts that ||/3 (1) ||^ < yl$^ho\\0 W \U 2 and, therefore, 

$M n >Cdi/logpi. 



A Standard inequalities in probability 

This section collects standard inequalities that shall be used throughout. The first concerns tails 
of chi-square random variables: a chi-square Xn with n degrees of freedom obeys 



P(X^ > (1 + e)n) < exp(- 



(l-log2) 2 



ne 



)■ 



(A.1) 



The second concerns the size of the dot product between a fixed vector and Gaussian random 
verctors. 

Lemma A.l Suppose A in R dxN has iid Af(0,l) entries and let z e R d a unit-norm vector. Then 

\\A T z\\ f <2VlogiV 

11 "too 



with probability at least 1- jr. (This also applies if z is a random vector independent from A.) 
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Lemma A. 2 (Sub-Gaussian rows [39]) Let A be an N x d matrix (N > d) whose rows are 
independent sub-Gaussian isotropic random vectors in V d . Then for every t > 0, 

cr min ( A) > VN - CVd - t 

with probability at least 1 - e~ c< . Here, o" m ; n is the minimum singular value of A and C = Cjc, 
c- ck > depend only on the sub- Gaussian norm K = maxj ||>lj||# 2 of the rows (see [39]). 

Lemma A. 3 With probability at least 1 - e~ dl l 2 , 

Proof This is a trivial consequence of Lemma A. 2 above with t = \fd\. ■ 

Lemma A. 4 If a and d\ are as in Section 1.2, all the columns in and y\\ have Euclidean 

norms in [3/4,5/4] with probability at least 1- (For a single column, the probability is at least 
equal to 1 - ■) 

Proof A column of or y\\ is of the form a = x + z\\ where x is uniform on the unit sphere of 
Si and z ~ jV(0, (a 2 /n)I n ). We have 

\\ x \\e 2 -\\ z \\l 2 ^\\ a h 2 ^\\ x \\e 2 + \\ z \\\\ e2 - 

The result follows from ||a;||^ 2 = 1 and ||z|| ||^ 2 < |, which holds with high probability. The latter is a 
consequence of (A.l) since ||*||||f (properly normalized) is a chi-square with d\ degrees of freedom 
and the bounds on a and d\ from (1.2). ■ 



B Geometric Lemmas 

Consider the linear program 

x* e argmin ||a;|L subject to y - Ax, (B-l) 

X 

and its dual 

i/*eargmax (y,v} subject to ||A r i/|| < 1. (B.2) 
Lemma B.l Any dual feasible point obeys 

1 

Wis L < . 

11 Ui ~ r(V(A)) 
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Proof Put r = r(V(A)) for short. By definition, there exists x with \x\g il < 1 such that Ax = rv. 
Now, 

r\\u\\ h = (Ax, v) = (x,A T u) < \\x\\ h \\A T u\\ £txi < 1. 



Strong duality Ix*!^ = (y,u*) < \\y\\£ 2 \\ u * \\h a ^ so gives: 
Lemma B.2 Any optimal solution x* to (B.l) obeys 

I y II 4, 



F Ik < 



Cl " r(P(A)) 
Lemma B.3 Assume p\ = N\jd\ > p* . Then 



with probability at least 1 - ^api _ e 



Proof Suppose A e [R rfxJV has columns chosen uniformly at random from the unit sphere of R d 
with p = N/d > pq. Then [34, Lemma 7.4] 



1. /\og(N/d)} 



P{r(P(A))<-\ *\' ' <e 



4 V d J 
ower bound on 1 

(Lemma A. 4) together with the fact that they have uniform orientation. 



The claim in the lemma follows from the lower bound on the Euclidean norm of the columns of 



Corollary B.4 With high probability as above, any dual feasible point to (8.12) obeys 

V 1 < 



9 log(iVi/di)" 

C Sharpening Lemma 8.3 Asymptotically 

Here, we assume that the ratio pi = Ni/d\ is fixed and JVi -*■ oo. In this asympotic setting, it is 
possible to sharpen Lemma 8.3. Our arguments are less formal than in the rest of the paper. 
Let xq £ VR N be an unknown vector, and imagine we observe 

y = Ax + z, 

where A is a d x N matrix with i.i.d. A/"(0, 1/d) entries, and z ~ Af(0, o~ 2 Id)- Let x be the solution 
to 

■ 1 II A l|2 x || || 

mm - \\y - Ax\ h + A \\x\\ £l . 
Then setting 5 = d/N, the main result in [5,6] states that almost surely, 

1 2 

lim — \\x - x \\l = HiviXo + T*Z;ar*) - X ] } = 5{tI - cr 2 ), 

N^oc, iv 
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where Z ~ A/*(0, 1) and the random variable Xq has the empirical distribution of the entries of xq. 
In addition, Z and Xq are independent. We refer to [5,6] for a precise statement. Above, a and r* 
are solutions to 



A 

2 



an 



l-^E{r)'(X + nZ;an)}] 

a 2 + -E[r](X + nZ-ar*) - X ] 2 
d 



(C.l) 
(C.2) 



Here, rj(x, 9) is applying a soft-thresholding rule elementwise. For a scalar t, this rule is of the form 

i](t, 9) = sgn(t) max(|i| - 9, 0). 

T (1) 

We apply this in the setting of Lemma 8.3 with xq = 0, Xq = 0. Here, A - U 1 Yjr and with 

abuse of notation y ■- Ufy\\ . In the asymptotic regime the vector y and the columns of A are 
both random Gaussian vectors with variance of each entry equal to 1/di + 1/n. Since the LASSO 
solution is invariant by rescaling of the columns and we are interested in bounding its norm, we 
assume without loss of generality that y and A have A/"(0, 1/d) entries M(0, 1/d), i.e. the variance 
of the noise z above is 1/d. With this, the above result simplifies to 

hm ^-WxWl^m^Z^T^fj^^-a 2 ), a 2 = l/d. 

To find a and t*, we solve 



A = an 



1, 



1 - -E{?j'(r t Z;«r t )} 
o 



, T 2 = a 2 + -E[i 1 (T*Z;aT*)] 2 . 
o 



Now notice that 

E{rj'(nZ;an)} - 2P{Z > a}, E[t)(t*Z; an)] 2 = r 2 E[ V (Z;a)] 2 . 
The equations then become 

2 r 



A = «T» 



1- -P{Z>a\ 
o 



1 



E[r,{Z-a)f 



Eliminating and solving for a yields 



Aa/(1- - E[rKZ; a)] 2 ) = aa 



1_ -p{z>a} 
o 



This one-dimensional nonlinear equation can be solved with high accuracy. Plugging in the solution 
in the expression for r* bounds the li norm of the solution. 

Now we explain how these relationships can be used to show ||£||^ 2 < 1 for p> p* as A -> 0. The 
argument for any A > follows along similar steps, which we avoid here. As A tends to zero we 
must have 

9 A 
= 1- -P{Z > a} => P{Z > a} = - => a = V^etfc' 1 ^) 
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O»,<5*) = (0.9254 , 0.35476) 







1 2 3 4 5 

a 



Figure 12: Left-hand side (blue) and right-hand side (red) of (C.3). The two curves 
intersect at (a*, 5*) = (0.9254,0.35476). 



where erfc 1 is the inverse of erfc(x) = 4= f™ e~ l dt. With this, we obtain 

o E[ V (Z;a)] 2 E[ V (Z;a)] 2 



\x\\l = N5(t; - a 1 ) = N5a 2 



lh " ' 6-E[r)(Z;a)¥ 5 - E[rj(Z; a)] 2 

Some algebraic manipulations give 

E[rj(Z; a)] 2 = (a 2 + l)erfc(a/\/2) - OL^e^ 12 = (a 2 + 1)5 - a^e^ 2 / 2 , 

where a = \/2erftr 1 (5). For the bound to be less than 1 it suffices to have E[r/(Z;a)] 2 < 5/2. After 
simplification, this is equivalent to 

5 = eTfc(a/V2) < J* e^l 2 . (C.3) 

V vr a z + 1/2 

The two functions on both sides of the above inequality are shown in Figure 12. As can be seen, 
for 5 < 0.35476 we have the desired inequality. This is equivalent to N\jdi = pi > p* = 2.8188. 



Acknowledgements 

E. C. is partially supported by AFOSR under grant FA9550-09-1-0643 and by ONR under grant N00014-09- 
1-0258 and by a gift from the Broadcom Foundation. M.S. is supported by a Benchmark Stanford Graduate 
Fellowship. We thank Rene Vidal for helpful discussions as well as Rina Foygel and Lester Mackey for a 
careful reading of the manuscript and insightful comments. E. C. would like to thank Chiara Sabatti for 
invaluable feedback on an earlier version of the paper. He also thanks the organizers of the 41st annual 
Meeting of Dutch Statisticians and Probabilists held in November 2012 where these results were presented. 
A brief summary of this work was submitted in August 2012 and presented at the NIPS workshop on Deep 
Learning in December 2012. 



References 

[1] P.K. Agarwal and N.H. Mustafa, fc-means projective clustering. In Proceedings of the twenty-third ACM 
SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 155-165, 2004. 



38 



[2] E. Arias-Castro, G. Chen, and G. Lerman. Spectral clustering based on local linear approximations. 
Electronic Journal of Statistics, 5:1537-1587, 2011. 

L. Bako. Identification of switched linear systems via sparse optimization. Automatica, 2011. 

M.F. Balcan, A. Blum, and A. Gupta. Approximate clustering without the approximation. In Proceed- 
ings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1068-1077. Society 
for Industrial and Applied Mathematics, 2009. 

M. Bayati and A. Montanari. The dynamics of message passing on dense graphs, with applications to 
compressed sensing. IEEE Transactions on Information Theory, 57(2):764-785, 2011. 

M. Bayati and A. Montanari. The LASSO risk for Gaussian matrices. IEEE Transactions on Information 
Theory, 58(4):1997 -2017, April 2012. 

S.R. Becker, E.J. Candes, and M.C. Grant. Templates for convex cone problems with applications to 
sparse signal recovery. Mathematical Programming Computation, pages 1-54, 2011. 

T.E. Boult and L. Gottesfeld Brown. Factorization-based segmentation of motions. In Proceedings of 
the IEEE Workshop on Visual Motion, pages 179-186, 1991. 

P.S. Bradley and O.L. Mangasarian. fc-plane clustering. Journal of Global Optimization, 16(l):23-32, 
2000. 

E. J. Candes and Y. Plan. Near-ideal model selection by l\ minimization. The Annals of Statistics, 
37(5A):2145-2177, 2009. 

E. J. Candes and Y. Plan. A probabilistic and RIPless theory of compressed sensing. IEEE Transactions 
on Information Theory, (99):1-1, 2010. 

E. J.. Candes and T. Tao. The Dantzig Selector: Statistical estimation when p is much larger than n. 
The Annals of Statistics, 35(6):2313-2351, 2007. 

G. Chen and G. Lerman. Spectral Curvature Clustering (SCC). International Journal of Computer 
Vision, 81(3):317-330, 2009. 

Y. Chen, N.M. Nasrabadi, and T.D. Tran. Hyperspectral image classification using dictionary-based 
sparse representation. IEEE Transactions on Geoscience and Remote Sensing, (99):1-13, 2011. 

E. Elhamifar and R. Vidal. Sparse subspace clustering. In IEEE Conference on Computer Vision and 
Pattern Recognition, CVPR 2009., pages 2790-2797. 

E. Elhamifar and R. Vidal. Sparse subspace clustering: Algorithm, theory, and applications, to appear 
in IEEE Transactions on Pattern Analysis and Machine Intelligence, arXiv preprint arXiv:1203.1005, 
2012. 

B. Eriksson, L. Balzano, and R. Nowak. High-rank matrix completion and subspace clustering with 
missing data. Arxiv preprint arXiv:1112.5629, 2011. 

A. Goh and R. Vidal. Segmenting motions of different types by unsupervised manifold clustering. In 
Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on, pages 1-6. IEEE, 
2007. 

Y.P.C. Kotropoulos and G.R. Arce. £i-graph based music structure analysis. In International Society 
for Music Information Retrieval Conference, ISMIR 2011. 

G. Lerman and T. Zhang. Robust recovery of multiple subspaces by geometric t p minimization. The 
Annals of Statistics, 39(5):2686-2715, 2011. 

G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma. Robust recovery of subspace structures by low-rank 
representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):171-184, Jan. 
2013. 



39 



[22] P.L. Loh and M.J. Wainwright. High-dimensional regression with noisy and missing data: Provable 
guarantees with nonconvexity. The Annals of Statistics, 40(3):1637~1664, 2012. 

[23] L. Lu and R. Vidal. Combined central and subspace clustering for computer vision applications. In 
Proceedings of the 23rd international conference on Machine learning, pages 593-600. ACM, 2006. 

[24] Y. Ma, H. Derksen, W. Hong, and J. Wright. Segmentation of multivariate mixed data via lossy data 
coding and compression. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(9):1546- 
1562, 2007. 

[25] Y. Ma and R. Vidal. Identification of deterministic switched arx systems via identification of algebraic 
varieties. Hybrid Systems: Computation and Control, pages 449-465, 2005. 

[26] Y. Ma, A.Y. Yang, H. Derksen, and R. Fossum. Estimation of subspace arrangements with applications 
in modeling and segmenting mixed data. SI AM review, 50(3):413-458, 2008. 

[27] B. McWilliams and G. Montana. Subspace clustering of high-dimensional data: a predictive approach. 
Arxiv preprint arXiv: 1203. 1065, 2012. 

[28] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. Advances in 
neural information processing systems, 2:849-856, 2002. 

[29] N. Ozay, M. Sznaier, and C. Lagoa. Model (in) validation of switched arx systems with unknown 
switches and its application to activity monitoring. In 49th IEEE Conference on Decision and Control 
(CDC), pages 7624-7630. IEEE, 2010. 

[30] N. Ozay, M. Sznaier, C. Lagoa, and O. Camps. GPCA with denoising: A moments-based convex 
approach. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3209- 
3216. IEEE, 2010. 

[31] L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimensional data: a review. ACM 
SIGKDD Explorations Newsletter, 6(1):90-105, 2004. 

[32] M. Rosenbaum and A.B. Tsybakov. Sparse recovery under matrix uncertainty. The Annals of Statistics, 
38(5):2620-2651, 2010. 

[33] M. Rosenbaum and A.B. Tsybakov. Improved matrix uncertainty selector. arXiv preprint 
arXiv: 11 1244.13, 2011. 

[34] M. Soltanolkotabi and E. J. Candes. A geometric analysis of subspace clustering with outliers. The 
Annals of Statistics, 40(4):2195-2238, 2012. 

[35] M.E. Tipping and CM. Bishop. Probabilistic principal component analysis. Journal of the Royal 
Statistical Society: Series B (Statistical Methodology), 61(3):611-622, 1999. 

[36] P. Tseng. Nearest q-flat to m points. Journal of Optimization Theory and Applications, 105(l):249-252, 
2000. 

[37] stats . Stanford. edu/~candes/RSC. 

[38] users . ece .gatech. edu/~sasif /homotopy. 

[39] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. 

Chapter 5 of the book Compressed Sensing, Theory and Applications, ed. Y. Eldar and G. Kutyniok, 
2012. 

[40] R. Vidal. Subspace clustering. Signal Processing Magazine, IEEE, 28(2):52-68, 2011. 

[41] R. Vidal, Y. Ma, and S. Sastry. Generalized Principal Component Analysis (GPCA). IEEE Transactions 
on Pattern Analysis and Machine Intelligence, 27(12):1945-1959, 2005. 



40 



[42] P. Wojtaszczyk. Stability and instance optimality for Gaussian measurements in compressed sensing. 
Foundations of Computational Mathematics, 10(1): 1-13, 2010. 

[43] J. Yan and M. Pollefeys. A general framework for motion segmentation: Independent, articulated, rigid, 
non-rigid, degenerate and non-degenerate. ECCV 2006, pages 94-106, 2006. 

[44] A. Zhang, N. Fawaz, S. Ioannidis, and A. Montanari. Guess who rated this movie: Identifying users 
through subspace clustering. Arxiv preprint arXiv:1208.1544, 2012. 

[45] T. Zhang, A. Szlam, and G. Lerman. Median fc-flats for hybrid linear modeling with many outliers. 
In IEEE 12th International Conference on Computer Vision Workshops (ICCV Workshops), pages 
234-241, 2009. 

[46] T. Zhang, A. Szlam, Y. Wang, and G. Lerman. Hybrid linear modeling via local best-fit flats. Interna- 
tional Journal of Computer Vision, pages 1-24, 2012. 

[47] F. Zhou, F. Torre, and J. K. Hodgins. Aligned cluster analysis for temporal segmentation of human 
motion. In 8th IEEE International Conference on Automatic Face and Gesture Recognition, 2008. 
FG'08., pages 1-7, 2008. 



41 



